perm filename PROPO3[7,ALS] blob sn#051784 filedate 1973-07-03 generic text, type T, neo UTF8
00010									June 26 1973
00020	
00030			 A Proposal for Speech Understanding Research
00040	
00050	
00060		It is proposed that the work on speech recognition that is
00070	now under way in the A.I. project at Stanford University be continued
00080	and extended with broadened aims in the field
00090	of speech understanding. This work gives considerable promise both of
00100	solving some of the immediate problems that beset speech
00110	understanding research and of providing a basis for future advances.
00120	
00130		It is further proposed that this work be more closely tied to
00140	the ARPA Speech Understanding Research effort than it has been in the
00150	past and that it have as its express aim the study and application to
00160	speech recognition of a machine learning process, that has proved
00170	highly successful in another application and that has already been
00180	tested out to a limited extent in speech recognition. The machine
00190	learning process offers both an automatic training scheme and the
00200	inherent ability of the system to adapt to various speakers and
00210	dialects. Speech recognition via machine learning represents a global
00220	approach to the speech recognition problem and can be incorporated
00230	into a wide class of limited vocabulary systems.
00240	
00250		Finally we would propose accepting responsibility for keeping
00260	other ARPA projects supplied with operating versions of the best
00270	current programs that we have developed. The availability of the high
00280	quality front end that the signature table approach provides would 
00290	enable designers of the various over-all systems
00300	to test the relative performance of the top-down portions of their
00310	systems without having to make allowances for the deficiencies
00320	of their currently available front ends. Indeed, if the signature table
00330	scheme can be made simple enough to compete on a time basis (and we
00340	believe that it can) then it may replace the other front end
00350	schemes that are currently in favor.
00360	
00370		Stanford University is well suited as the site for such work,
00380	having both the facilities for this work and a staff of people with
00390	experience and interest in machine learning, phonetic analysis, and
00400	digital signal processing. The staff at present consists of the
00410	proposed Principal Investigator Arthur l. Samuel, one post-doctoral
00420	staff member Ravindra Thosar, who
00430	had worked on speech  recognition and synthesis in India, a
00440	second member Dr. Neil Miller, who has had considerable signal-processing
00450	experience and a few graduate students. It is anticipated that a staff
00460	of not more than 3 full time members with the help of 3 or 4 graduate
00470	students could mount a meaningful program, which should be funded for a
00480	mimimum of two years to ensure continuity of effort.
00490	We would expect to demonstrate the utility of the
00500	Signature Table approach within this time span and to provide a working
00510	system that could be used as the front end for any of the
00520	speech understanding systems that are currently under
00530	development or are being planned.
00550	
00560		Ultimately we would
00570	like to have a system capable of understanding speech from an
00580	unlimited domain of discourse and with an unknown speaker. It seems not
00590	unreasonable to expect the system to deal with this situation very
00600	much as people do when they adapt their understanding processes to
00610	the speakers idiosyncrasies during the conversation. The signature table
00620	method gives promise of contributing toward the solution of this
00630	problem as well as being a
00640	possible answer to some of the more immediate problems.
00650	
00660		The initial thrust of the proposed work would be toward the
00670	development of adaptive learning techniques, using the signature
00680	table method and some more recent varients and extentions of this
00690	basic procedure. We have already demonstrated the usefulness of this
00700	method for the initial assignment of significant features to the
00710	acoustic signals. One of the next steps will be to extend the method
00720	to include acoustic-phonetic probabilities in the decision process.
00730	
00740		Still another aspect to be studied would be the amount of
00750	preprocessing that should be done and the desired balance between
00760	bottom-up and top-down approaches. It is fairly obvious that
00770	decisions of this sort should ideally be made dynamicallly depending
00780	upon the familiarity of the system with the domain of
00790	discourse and with the characteristics of the speaker.
00800	Compromises will undoubtedly have to be made in any immediately
00810	realizable system but we should understand better than we now do the
00820	limitations on the system that such compromises impose.
00830	
00840		It may be well at this point to describe the general
00850	philosophy that has been followed in the work that is currently under
00860	way and the results that have been achieved to date. We have been
00870	studying elements of a speech recognition system that is not
00880	dependent upon the use of a limited vocabulary and that can recognize
00890	continuous speech by a number of different speakers.
00900	
00910		Such a system should be able to function successfully either
00920	without any previous training for the specific speaker in question or
00930	after a short training session in which the speaker would be asked to
00940	repeat certain phrases designed to train the system on those phonetic
00950	utterances that seemed to depart from the previously learned norm. In
00960	either case it is believed that some automatic or semi-automatic
00970	training system should be employed to acquire the data that is used
00980	for the identification of the phonetic information in the speech. We
00990	believe that this can best be done by employing a modification of the
01000	signature table scheme previously discribed. A brief review of this
01010	earlier form of signature table is given in Appendix 1.
01020	
01030		The over-all system is envisioned as one in which the more or
01040	less conventional method is used of separating the input speech into
01050	short time slices for which some sort of frequency analysis,
01060	homomorphic, LPC, or the like, is done. We then interpret this
01070	information in terms of significant features by means of a set of
01080	signature tables. At this point we define longer sections of the
01090	speech called segments which are obtained by grouping together varying
01100	numbers of the original slices on the basis of their similarity. This
01110	then takes the place of other forms of initial segmentation. Having
01120	identified a series of  in this way we next use another set of
01130	signature tables to extract information from the sequence of segments
01140	and combine it with a limited amount of syntactic and semantic
01150	information to define a sequence of phonemes.
01160	
01170		While it would be possible to extend this bottom up approach
01180	still further, it seems reasonable to break off at this point and
01190	revert to a top down approach from here on. The real difference in
01200	the overall system would then be that the top down analysis would
01210	deal with the outputs from the signature table section as its
01220	primitives rather than with the outputs from the initial measurements
01230	either in the time domain or in the frequency domain. In the case of
01240	inconsistencies the system could either refer to the second choices
01250	retained within the signature tables or if need be could always go
01260	clear back to the input parameters. The decision as to how far to
01270	carry the initial bottom up analysis must depend upon the relative
01280	cost of this analysis both in complexity and processing time and the
01290	certainty with which it can be performed as compared with the costs
01300	associated with the rest of the analysis and the certainty with which
01310	it can be performed, taking due notice of the costs in time of
01320	recovering from false starts.
01330	
01340		Signature tables can be used to perform four essential
01350	functions that are required in the automatic recognition of speech.
01360	These functions are: (1) the elimination of superfluous and
01370	redundant information from the acoustic input stream, (2) the
01380	transformation of the remaining information from one coordinate
01390	system to a more phonetically meaningful coordinate system, (3) the
01400	mixing of acoustically derived data with syntactic, semantic and
01410	linguistic information to obtain the desired recognition, and (4) the
01420	introduction of a learning mechanism.
01430	
01440		The following three advantages emerge from this method of
01450	training and evaluation.
01460		1) Essentially arbitrary inter-relationships between the
01470	input terms are taken in account by any one table. The only loss of
01480	accuracy is in the quantization.
01490		2) The training is a very simple process of accumulating
01500	counts. The training samples are introduced sequentially, and hence
01510	simultaneous storage of all the samples is not required.
01520		3) The process linearizes the storage requirements in the
01530	parameter space.
01540	
01550		The signature tables, as used in speech recognition, must be
01560	particularized to allow for the multi-category nature of the output.
01570	Several forms of tables have been investigated. Details of the current
01580	system are given in Appendix 2. For some early results  see 
01590	SUR Note 43 "Some Preliminary Experiments in Speech Recognition
01600	Using Signature Tables" by R.B.Thosar and A.L.Samuel.
01620	
01630		Work is currently under way on a major refinement of the
01640	signature table approach which adopts a somewhat more rigorous
01650	procedure. Preliminary results with this scheme indicate that a
01660	substantial improvement has been achieved. This effort is described in
01670	a recent report SUR Note 81 on "Estimation of Probability Density Using
01680	Signature Tables for Application to Pattern Recognition, by 
01690	R.B.Thosar.
01700	
01720		We are currently involved in work on a segmentation
01730	procedure which has already demonstrated its ability to compete with other
01740	proposed segmentation systems, even when used to process speech from 
01750	speakers whose utterances  were not used during the training
01760	sequence.
     

00010	FACILITIES
00020	
00030	The computer  facilities  of  the  Stanford  Artificial  Intelligence
00040	Laboratory include the following equipment.
00050	
00060	Central Processors:  Digital Equipment Corporation PDP-10 and PDP-6
00070	
00080	Primary Store:       65K words of 1.7 microsecond DEC Core
00090		             65K words of 1 microsecond Ampex Core
00100	                     131K words of 1.6 microsecond Ampex Core
00110	
00120	Swapping Store:      Librascope disk (5 million words, 22 million
00130	                     bits/second transfer rate)
00140	
00150	File Store:          IBM 3330 disc file, 6 spindles (leased)
00160	
00170	Peripherals:         4 DECtape drives, 2 mag tape drives, line printer,
00180		             Calcomp plotter, Xerox Graphics Printer
00190	
00200	Communications
00210	    Processor:	     BBN IMP (Honeywell DDP-516) connected to the
00220			     ARPA network.
00230	
00240	Terminals:           58 TV displays, 6 III displays, 3 IMLAC displays,
00250		             1 ARDS display, 15 Teletype terminals
00260	
00270	Special  Equipment:  Audio  input  and  output  systems, hand-eye
00280	                     equipment (2 TV cameras, 3 arms), remote-
00290	                     controlled cart
     

00010	   		RESEARCH GRANT BUDGET
00020			
00030			TWO YEARS BEGINNING OCTOBER 1, 1973
00040	
00050	
00060	BUDGET CATEGORY					YEAR 1	YEAR 2
00070	-----------------------------------------------------------------
00080	I. SALARIES & WAGES:
00090		
00100		Samuel, A.,
00110		Senior Research Associate
00120		Principal Investigator, 75%		 20,000	 20,000
00130	
00140		------,
00150		Research Associate			 14,520	 14,520
00160	
00170		Miller, N.,
00180		Research Associate			 13,680	 13,680
00190	
00200		------,
00210		Student Research Assistant,
00220		50% academic year, 100% summer		  4,914	  5,070
00230	
00240		------,
00250		Student Research Assistant,
00260		50% academic year, 100% summer		  4,914	  5,070
00270	
00280		Reserve for Salary Increases
00290		@ 5.5% per year				  2,901	  5,980
00300							-------	-------
00310	
00320		TOTAL SALARIES AND WAGES		$60,929 $64,320
00330	
00340	II. STAFF BENEFITS:
00350	
00360		17.0% 10-1-73 to 8-31-74		  9,495
00370		18.3% 9-1-74 to 8-31-75			    929  10,790
00380		19.3% 9-1-75 to 9-30-75				  1,034
00390							-------	-------
00400		TOTAL STAFF BENEFITS			$10,424 $11,824
00410	
00420	III. TRAVEL:
00430	
00440		Domestic -
00450			Local		150
00460			East Coast	450
00470					---
00480							   $600    $600
00490	
00500	IV.  EXPENDABLE MATERIALS & SERVICES:
00510	
00520		A. Telephone Service	480
00530		B. Office Supplies	600
00540					---
00550							 $1,080  $1,080
00560	
00570	V.  PUBLICATIONS COST:
00580	
00590		2 Papers @ 500 ea.			 $1,000  $1,000
00600							------- -------
00610	
00620	VI. TOTAL DIRECT COSTS:
00630	
00640		(Items I through V)			$74,033 $78,824
00650	
00660	VII. INDIRECT COSTS:
00670	
00680		On Campus - 47% of NTDC			$34,796 $37,047
00690	
00700							-------	-------
00710	VIII. TOTAL COSTS:
00720	
00730		(Items VI + VII)		       $108,829 $115,871          
00740						       -------- --------
     

00010	COGNIZANT PERSONNEL
00020	
00030	
00040	        For contractual matters:
00050	
00060	                Office of the Research Administrator
00070	                Stanford University
00080	                Stanford, California 94305
00090	
00100	                Telephone: (415) 321-2300, ext. 3330
00110	
00120	        For technical and scientific matters regarding this proposal:
00130	
00140	                Arthur l. Samuel
00150	                Computer Science Department
00160	                Stanford University
00170	                Stanford, California 94305
00180	
00190	                Telephone: (415) 321-2300, ext. 4971
00200	
00210	        For administrative matters, including questions relating
00220	        to the budget or property acquisition:
00230	
00240	                Mr. Lester D. Earnest
00250	                Computer Science Department
00260	                Stanford University
00270	                Stanford, California 94305
00280	
00290	                Telephone: (415) 321-2300, ext. 4971
     

00010	
00020			Appendix 1
00030	
00040		The early form of a signature table
00050	
00060		For those not familiar with the use of signature tables as
00070	used by Samuel in programs which played the game of checkers, the
00080	concept is best illustrated (Fig.1) by an arrangement of tables used
00090	in the program. There are 27 input terms. Each term evaluates a
00100	specific aspect of a board situation and it is quantized into a
00110	limited but adequate range of values, 7, 5 and 3, in this case. The
00120	terms are divided into 9 sets with 3 terms each, forming the 9 first
00130	level tables. Outputs from the first level tables are quantized to 5
00140	levels and combined into 3 second level tables and, finally, into one

00150	third-level table whose output represents the figure of merit of the
00160	board in question.
00170	
00180		A signature table has an entry for every possible combination
00190	of the input vector. Thus there are 7*5*3 or 105 entries in each of
00200	the first level tables. Training consists of accumulating two counts
00210	for each entry during a training sequence. Count A is incremented
00220	when the current input vector represents a prefered move and count D
00230	is incremented when it is not the prefered move. The output from the
00240	table is computed as a correlation coeficient
00250	 			C=(A-D)/(A+D).
00260		The figure of merit for a board is simply the
00270	coefficient obtained as the output from the final table.
     

00010			Appendix 2
00020	
00030		Initial Form of Signature Table for Speech Recognition
00040	
00050		The signature tables, as used in speech recognition, must be
00060	particularized to allow for the multi-catagory nature of the output.
00070	Several forms of tables have been investigated. The initial form
00080	tested and used for the data presented in the attached paper uses
00090	tables consisting of two parts, a preamble and the table proper. The
00100	preamble contains: (1) space for saving a record of the current and
00110	recent output reports from the table, (2) identifying information as
00120	to the specific type of table, (3) a parameter that identifies the
00130	desired output from the table and that is used in the learning
00140	process, (4) a gating parameter specifying the input, that is to be
00150	used to gate the table, (5) the sign of the gate,
00160	 (6) the gating level to be used and (7)
00170	parameters that identify the sources of the normal inputs to the
00180	table.
00190	
00200		All inputs are limited in range and specify either the
00210	absolute level of some basic property or more usually the probability
00220	of some property being present. These inputs may be from the original
00230	acoustic input or they may be the outputs of other tables. If from
00240	other tables they may be for the current time step or for earlier
00250	time steps, (subject to practical limits as to the number of time
00260	steps that are saved).
00270	
00280		The output, or outputs, from each table are similarly limited
00290	in range and specify, in all cases, a probability that some
00300	particular significant feature, phonette, phoneme, word segment, word
00310	or phrase is present.
00320	
00330		We are limiting the range of inputs and outputs to values
00340	specified by 3 bits and the number of entries per table to 64
00350	although this choice of values is a matter to be determined by
00360	experiment. We are also providing for any of the following input
00370	combinations, (1) one input of 6 bits, (2) two inputs of 3 bits each,
00380	(3) three inputs of 2 bits each, and (4) six inputs of 1 bit each.
00390	The uses to which these differint forms are put will be described
00400	later.
00410	
00420		The body of each table contains entries corresponding to
00430	every possible combination of the allowed input parameters. Each
00440	entry in the table actually consists of several parts. There are
00450	fields assigned to accumulate counts of the occurrances of incidents
00460	in which the specifying input values coincided with the different
00470	desired outputs from the table as found during previous learning
00480	sessions and there are fields containing the summarized results of
00490	these learning sessions, which are used as outputs from the table.
00500	The outputs from the tables can then express to the allowed accuracy
00510	all possible functions of the input parameters.
00520	
00530	Operation in the Training Mode
00540	
00550		When operating in the training mode the program is supplied
00560	with a sequence of stored utterances with accompanying phonetic
00570	transcriptions. Each sample of the incoming speech signal is
00580	analysed (Fourier transforms or inverse filter equivalent) to obtain
00590	the necessary input parmeters for the lowest level tables in the
00600	signature table hierarchy. At the same time reference is made to a
00610	table of phonetic "hints" which prescribe the desired outputs from
00620	each table which correspond to all possible phonemic inputs. The
00630	signature tables are then processed.
00640	
00650		The processing of each table is done in two steps, one
00660	process at each entry to the table and the second only periodically.
00670	The first process consists of locating a single entry line within the
00680	table as specified by the inputs to the table and adding a 1 to the
00690	appropriate field to indicate the presence of the property specified
00700	by hint table as corresponding to the phoneme specified in the
00710	phonemic transcription. At this time a report is also made as to the
00720	table's output as determined from the averaged results of previous
00730	learning so that a running record may be kept of the performance of
00740	the system. At periodic intervals all tables are updated to
00750	incorporate recent learning results. To make this process easily
00760	understandable, let us restrict our attention to a table used to
00770	identify a single significant feature say Voicing. The hint table
00780	will identify whether or not the phoneme currently being processed is
00790	to be considered voiced. If it is voiced, a 1 is added to the "yes"
00800	field of the entry line located by the normal inputs to the table. If
00810	it is not voiced, a 1 is added to the "no" field. At updating time
00820	the output that this entry will subsequently report is determined by
00830	dividing the accumulated sum in the "yes" field by the sum of the
00840	numbers in the "yes" and the "no" fields, and reporting this quantity
00850	as a number in the range from 0 to 7. Actually the process is a bit
00860	more complicated than this and it varies with the exact type of table
00870	under consideration, as reported in detail in appendix B. Outputs
00880	from the signature tables are not probabilities, in the strict sense,
00890	but are the statistically-arrived-at odds based on the actual
00900	learning sequence.
00910	
00920		The preamble of the table has space for storing twelve past
00930	outputs. An input to a table can be delayed to that extent. This table
00940	relates outcomes of previous events with the present hint-the
00950	learning input. A certain amount of context dependent learning is thus
00960	possible with the limitation that the specified delays are constant.
00970	
00980		The interconnected hierarchy of tables form a network which
00990	runs increamentally, in steps synchronous with time window over which
01000	the input signal is analised. The present window width is set at 12.8
01010	ms.(256 points at 20 K samples/sec.) with overlap of 6.4 ms. Inputs
01020	to this network are the parameters abstracted from the frequency
01030	analyses of the signal, and the specified hint. The outputs of the
01040	network could be either the probability attached to every phonetic
01050	symbol or the output of a table associated with a feature such as
01060	voiced, vowel etc. The point to be made is that the output generated
01070	for a sample is essentially independent of its contiguous
01080	samples. The dependency achieved by using delayes in the inputs is
01090	invisible to the outputs. The outputs thus report the best estimate on
01100	what the current acoustic input is with no relation to the past
01110	outputs. Relating the successive outputs along the time dimension is
01120	realised by counters.
01130	
01140	The Use of COUNTERS
01150	
01160		The transition from initial sample space to segment space is
01170	made posible by means of COUNTERS which are summed and reiniated
01180	whenever their inputs cross specified threshold values, being
01190	triggered on when the input exceeds the threshold and off when it
01200	falls below. Momentary spikes are eliminated by specifying time
01210	hysteresis, the number of consecutive samples for which the input
01220	must be above the threshold. The output of a counter provides
01230	information about starting time, duration and average input for the
01240	period it was active.
01250	
01260		Since a counter can reference a table at any level in the
01270	hierarchy of tables, it can reflect any desired degree of information
01280	reduction. For example, a counter may be set up to show a section of
01290	speech to be a vowel, a front vowel or the vowel /I/. The counters can
01300	be looked upon to represent a mapping of parameter-time space into a
01310	feature-time space, or at a higher level symbol-time space. It may be
01320	useful to carry along the feature information as a back up in those
01330	situations where the symbolic information is not acceptable to
01340	syntactic or semantic interpretation.
01350	
01360		In the same manner as the tables, the counters run completely
01370	independent of each other. In a recognition run the counters may
01380	overlap in arbitrary fashion, may leave out gaps where no counter has
01390	been triggered or may not line up nicely. A properly segmented output,
01400	where the consecutive sections are in time sequence and are neatly
01410	labled, is essential for processing it further. This is achieved by
01420	registering the instants when the counters are triggered or
01430	terminated to form time slices called segments.
01440	
01450		An event is the period between successive activation or
01460	termination of any counter. An event shorter than a specified time is
01470	merely ignored. A record of event durations and upto three active
01480	counters, ordered according to their probability, is maintained.
01490	
01500		An event resulting from the processing described so far,
01510	represents a phonette - one of the basic speech categories defined as
01520	hints in the learning process. It is only an estimate of closeness to
01530	a speech category , based on past learning. Also each category has a
01540	more-or-less stationary spectral characterisation. Thus a category may
01550	have a phonemic equivalent as in the case of vowels , it may be
01560	common to phoneme class as for the voiced or unvoiced stop gaps or it
01570	may be subphonemic as a T-burst or a K-burst. The choices are based on
01580	acoustic expediency, i.e. optimisation of the learning rather than
01590	any linguistic considerations. However a higher level interpretive
01600	programs may best operate on inputs resembling phonemic
01610	trancription. The contiguous segments may be coalesced into phoneme like
01620	units using diadic or triadic probabilities and acoustic-phonetic
01630	rules particular to the system. For example, a period of silence
01640	followed by a type of burst or a short friction may be combined to
01650	form the corresponding stop. A short friction or a burst following a
01660	nasal or a lateral may be called a stop even if the silence period is
01670	short or absent. Clearly these rules must be specific to the system,
01680	based on the confidence with which durations and phonette categories
01690	are recognised.
     

00010			Appendix 3
00020		SPEECH RESEARCH AT STANFORD UNIVERSITY
00030	
00040		Efforts to establish a vocal communicatton link with  a
00050	digital  computer  have  been  underway at Stanford since 1963.
00060	These efforts have been primarily concerned with four areas  of
00070	research.     First,  basic research in extracting phonemic and
00080	linguistic information from speech waveforms has been  persued.
00090	Second,  the  application  of automatic learning processes have
00100	been investigated.  Third, the use of syntax and  semantics  to
00110	aid  speech  recoginition have been explored.          Finally,
00120	the application of speech recognition systems to control  other
00130	processes developed at the Artifical Intelligence Facility have
00140	been carried out.      These efforts have been carried  out  in
00150	parallel   with  varying  emphasis  on  particular  factors  at
00160	different times.  None of the facets of this research have been
00170	solved  completely.  However, each limited success has provided
00180	insight and direction which opened  a  wealth  of  challenging,
00190	state of the art, reseach projects.
00200	
00210		The  fruits  of Stanford's speech research program were
00220	first seen in October 1964 when Raj Reddy  published  a  report
00230	describing  his  preliminary  investigations on the analysis of
00240	speech waveforms [1]. This report described the intial  digital
00250	processes  developed  for  analyzing  waveforms  of  vowels and
00260	consonants, fundamental  frequency,  and  formants.       These
00270	processes were used as the basis for a simple vowel recognition
00280	system and synthesis of sounds.
00290	
00300	By 1966 Reddy had build a much larger system which  obtained  a
00310	phonemic  transcription  and  which  achieved  segmentation  of
00320	connected  phrases  utilizing  hypotheses  testing  [2].   This
00330	system  represented  a  signficant  contribution towards speech
00340	sound segmentation [3]. This system operated on a subset of the
00350	speech of a single cooperative speaker.
00360	
00370		In  1967  Reddy and his students had refined several of
00380	his processes and published  papers  on  phoneme  grouping  for
00390	speech  recognition  [4],  pitch period determination of speech
00400	sounds [5], and computer recognition of connected  speech  [6].
00410	At this time Reddy was considering the introduction of learning
00420	into his processes at several stages.  He was also  supervising
00430	several  related  student projects including limited vocabulary
00440	speech  recognition,  a   phoneme   string   to   word   string
00450	transcription   program,   a  syllable  junction  program,  and
00460	telephone speech recognition.
     

00010		1968  was  an  extremely  productive year for Professor
00020	Reddy and his speech group.  Pierre Vicens published  a  report
00030	on  preprocessing  for  speech  analysis [7]; Reddy published a
00040	paper on the computer transcription of  phonemic  symbols  [8];
00050	Reddy and Ann Robinson published a paper on phoneme-to-grapheme
00060	translation of English [9]; Reddy and Vicens published a  paper
00070	on  procedures  for  segmentation of connected speech [10]; and
00080	Reddy presented a paper in Japan on consonantal clustering  and
00090	connected  speech  recognition [11].  In addition to this basic
00100	speech research, a paper by John McCarthy , Lester Earnest, Raj
00110	Reddy,  and  Pierre Vicens was presented at the 1968 Fall Joint
00120	Computer Conference entitled "A Computer With Hands, Eyes,  and
00130	Ears"  which,  in  part,  described  the  vocal  control of the
00140	artifical arm developed at Stanford [12].
00150	
00160		By 1969 the Stanford developed  speech  processes  were
00170	successfully  segmenting and parsing continuous utterances from
00180	a restricted syntax.       Pierre Vicens produced a  report  on
00190	aspects  of  speech  recognition by computer which investigated
00200	the techniques and methodologies which are useful in  achieving
00210	close  to  real-time recognition of speech [13].    In March of
00220	1969, Raj Reddy, Dave Espar, and Art Eisenson produced  a  16mm
00230	color  movie with sound entitled "Hear Here".         This film
00240	described the state of the speech  recognition  project  as  of
00250	Spring, 1969.  In addition, Raj Reddy completed a report on the
00260	use of environmental, syntactic, and probabilistic  constraints
00270	in  vision and speech [14] and Reddy and R.B.    Neely reported
00280	their research  on  the  contextual  analysis  of  phonemes  of
00290	English [15].
00300	
00310		In  1970,  a paper was presented by Raj Reddy, L.    D.
00320	Erman, and R.    B.   Neely concerning the  speech  recognition
00330	project at the IEEE Systems Science and Cybernetics Conference.
00340	At this time Professor Reddy left Stanford to join the  faculty
00350	of  Carneige-Mellon  University and Dr.    Arthur Samuel became
00360	the head of the Stanford speech research efforts.    Dr. Samuel
00370	was  the  developer of an extremely successful machine learning
00380	scheme which  had  previously  been  applied  to  the  game  of
00390	checkers  [16],[17].     He  resolved  to  apply them to speech
00400	recognition.
     

00010		By 1971 the first report on a speech recognition system
00020	utilizing  Samuel's learning scheme was written by George White
00030	[18].  This report was primarily concerned with the examination
00040	of  the  properties  of  signature  trees  and  the  heuristics
00050	involved in their application to  an  optimal  minimal  set  of
00060	features  to  achieve  recognition.     Also at this time, M.M.
00070	Astrahan produced a report describing his  research  on  speech
00080	analysis by clustering, or the hyperphoneme method [19].   This
00090	process attempted to  do  speech  recognition  by  mathematical
00100	classifications   instead   of   the  traditional  phonemes  or
00110	linguistic categories.              This  was  accomplished  by
00120	nearest-neighbor classification in a hyperspace wherein cluster
00130	centers, or hyperphonemes, had been established.
00140	
00150		In 1972 R.B.Thosar and A.L.  Samuel presented a  report
00160	concerning  some  preliminary experiments in speech recognition
00170	using signature tables [20].      This approach  represented  a
00180	general   attack   on  speech  recognition  employing  learning
00190	mechanisms at each stage of classification.
00200	
00210		The speech effort in  1973  has  been  devoted  to  two
00220	areas.      First,  a  mathematically  rigorous examination and
00230	improvement of the signature table learning mechanism has  been
00240	accomplished  by  R.B.  Thosar.   Second, a segmentation scheme
00250	based  on  signature  tables  is  being  developed  to  provide
00260	accurate segmentation together with probabilities or confidence
00270	values  for  the  most  likely  phoneme  occuring  during  each
00280	segment.   This process attempts to extract as much information
00290	about  an  acoustic  signal  as  possible  and  to  pass   this
00300	information to higher level processes.  The preliminary results
00310	of this segmentation scheme will be  presented  at  the  speech
00320	segmentation  workshop  to  be  held in July at Carneige-Mellon
00330	University. In addition to these activities, a new, high  speed
00340	pitch  detection  scheme has been developed by J. A. Morrer and
00350	has been submitted for publication.
     

00010			BIBLIOGRAPHY
00020	
00030	1.	AIM-26, Raj  Reddy,  Experiments  on  Automatic  Speech
00040	Recognition by a Digital Computer, October 1964, 19 pages.
00050	
00060	2.	AIM-43,  Raj  Reddy,  An  Approach  to  Computer Speech
00070	Recognition  by  Direct  Analysis  of  the   Speech   Waveform,
00080	September 1966, 144 pages.
00090	
00100	3.	D.  Reddy, "Segmentation of Speech Sounds," J.  Acoust.
00110	Soc. Amer., August 1966.
00120	
00130	4.	D. Reddy, "Phoneme Grouping for Speech Recognition," J.
00140	Acoust. Soc. Amer., May, 1967.
00150	
00160	5.	D.    Reddy,  "Pitch  Period  Determination  of  Speech
00170	Sounds," Comm. ACM, June, 1967.
00180	
00190	6.	D. Reddy, "Computer Recognition of  Connected  Speech,"
00200	J. Acoust. Soc. Amer., August, 1967.
00210	
00220	7.	AIM-71,   Pierre   Vicens,   Preprocessing  for  Speech
00230	Analysis, October 1968, 33 pages.
00240	
00250	8.	D. Reddy, "Computer Transcription of Phonemic Symbols",
00260	J. Acoust. Soc. Amer., August 1968.
00270	
00280	9.	D.     Reddy,  and  Ann  Robinson, "Phoneme-To-Grapheme
00290	Translation  of   English",   IEEE   Trans.         Audio   and
00300	Electroacoustics, June 1968.
00310	
00320	10.	D.   Reddy, and P. Vicens, "Procedures for Segmentation
00330	of Connected Speech," J. Audio Eng. Soc., October 1968.
00340	
00350	11.	D.  Reddy, "Consonantal Clustering and Connected Speech
00360	Recognition",  Proc. Sixth International Congress of Acoustics,
00370	Vo. 2, pp. C-57 to C-60, Tokyo, 1968.
00380	
00390	12.	John McCarthy, Lester  Earnest,  D.    Raj  Reddy,  and
00400	Pierre  Vicens,  "A  Computer  With  Hands,  Eyes,  and  Ears",
00410	Proceedings of the Fall Joint Computer Conference, 1968.
     

00010	13.	AIM-85, Pierre Vicens, Aspects of Speech Recognition by
00020	Computer, April 1969, 210 pages.
00030	
00040	14.	AIM-78,  D.   Raj  Reddy,  On the Use of Environmental,
00050	Syntactic and Probabilistic Constraints in Vision  and  Speech,
00060	January 1969, 23 pages.
00070	
00080	15.	AIM-79,  D.    R.   Reddy and R.  B.  Neely, Contectual
00090	Analysis of Phonemes of English, January 1969, 71 pages.
00100	
00110	16.	A.  L.  Samuel, "Some Studies in Machine Learning Using
00120	the Game of Checkers," IBM Journal 3, 211-229 (1959).
00130	
00140	17.	A.  L.  Samuel, "Some Studies in Machine Learning Using
00150	the Game of Checkers, II - Recent Progress," IBM Jour. of  Res.
00160	and Dev., 11, pp. 601-617.
00170	
00180	18.	AIM-136,  George  M.    White, Machine Learning Through
00190	Signature Trees. Applications to Human Speech, October 1870, 40
00200	pages.
00210	
00220	19.	AIM-124,   M.    M.     Astrahan,  Speech  Analysis  by
00230	Clustering, or the Hyperphoneme Method, June 1970, 22 pages.
00240	
00250	20.	R.  B.  Thosar and A.   L.   Samuel,  Some  Preliminary
00260	Experiments   in   Speech  Recognition  Using  Signature  Table
00270	Learning, ARPA Speech Understanding Research Group Note 43.
00280	21.	R.B.Thosar, @stimation of Probability Densities using
00290	Signature Tables for Application to Pattern Recognition, ARPA
00300	Speech Understanding Research Group Note 81.